多视图检测(MVD)对于拥挤环境中的遮挡推理非常有效。虽然最近使用深度学习的作品在该领域取得了重大进展,但它们已经忽略了泛化方面,这使得它们\ emph {现实世界部署不切实际。我们工作的关键新颖性是\ emph {形式化}三种临界形式的普遍化和\ emph {建议实验来评估它们}:泛化与i)不同数量的相机,ii)变化的相机位置,最后,iii)到新场景。我们发现现有的最先进的模型通过对单个场景和相机配置过度提供了较差的概括。为了解决问题:(a)我们提出了一种新颖的通用MVD(GMVD)数据集,同时使用变化的日间,相机配置,不同数量的相机以及(B)来吸收多样化的场景,以及(B)我们讨论了对MVD带来概括的属性并提出一个鞍座模型融合它们。我们在WildTrack,MultiviewX和GMVD数据集上执行一套全面的实验,以激励评估MVD方法的概括能力,并证明所提出的方法的功效。可以在\ url {https:github.com/jeetv/gmvd}中找到代码和建议的数据集
translated by 谷歌翻译
自早期以来,针对病毒的疫苗一直是小时的需求。但是,很难(准时)将疫苗有效地分配给一个国家的所有角落,尤其是在大流行期间。考虑到人口的广泛,多元化的社区以及智慧社会的需求,有效地在任何国家/国家中优化疫苗分配策略是一项重要任务。尽管各种疫苗管理站点的数据(大数据)大量可以开采,以获得有关大规模疫苗接种驱动器的宝贵见解,但很少有尝试彻底改变传统的大规模疫苗接种运动来减轻社会经济危机大流行国家。在本文中,我们在研究和实验中弥合了这一差距。我们收集公开可用的每日疫苗接种数据,并仔细分析以产生意义上的见解和预测。我们提出了一个新颖的框架,利用了我们称为疫苗的监督学习和强化学习(RL),该学习能够学习一个国家状态下预测疫苗接种的需求,并建议该州的最佳疫苗分配以最低成本采购和供应。目前,我们的框架接受了美国疫苗接种数据的训练和测试。
translated by 谷歌翻译
6-DOF的视觉定位系统利用植根于3D几何形状的原则方法来对图像进行准确的摄像头姿势估计图。当前的技术使用层次管道并学到了2D功能提取器来提高可扩展性并提高性能。但是,尽管典型召回@0.25m类型的指标获得了,但由于其“最差”性能领域,这些系统仍然对实际应用(如自动驾驶汽车)的实用性有限 - 在某种程度上提供不足的召回率的位置。在这里,我们研究了使用“位置特定配置”的实用性,其中将地图分割为多个位置,每个位置都有自己的配置,用于调节姿势估计步骤,在这种情况下,在多摄像机系统中选择摄像机。在福特AV基准数据集上,我们证明了与使用现成管道相比,我们证明了最大的最差案例定位性能 - 最小化数据集的百分比,该数据集的百分比降低了一定的误差耐受性,并提高了整体定位性能。我们提出的方法尤其适用于自动驾驶汽车部署的众群体模型,在该模型中,AV机队定期穿越已知的路线。
translated by 谷歌翻译
作为自我监督的代表学习中的一个精美工具,近年来对比学习越来越关注。从本质上讲,对比学习旨在利用用于代表学习的正面和负样本对,这与利用特征空间中的邻居信息涉及利用邻居信息。通过调查对比学习和邻里分量分析(NCA)之间的联系,我们提供了一种对比学习的新型随机最近邻近的观点,并随后提出了一系列优于现有的对比损失。在我们拟议的框架下,我们展示了一种新的方法来设计集成的对比损失,可以同时实现下游任务的良好准确性和鲁棒性。凭借综合框架,我们对标准准确性的高达6 \%改进,提高了对普通准确性的17%。
translated by 谷歌翻译
我们呈现NESF,一种用于单独从构成的RGB图像中生成3D语义场的方法。代替经典的3D表示,我们的方法在最近的基础上建立了隐式神经场景表示的工作,其中3D结构被点亮功能捕获。我们利用这种方法来恢复3D密度领域,我们然后在其中培训由构成的2D语义地图监督的3D语义分段模型。尽管仅在2D信号上培训,我们的方法能够从新颖的相机姿势生成3D一致的语义地图,并且可以在任意3D点查询。值得注意的是,NESF与产生密度场的任何方法兼容,并且随着密度场的质量改善,其精度可提高。我们的实证分析在复杂的实际呈现的合成场景中向竞争性2D和3D语义分割基线表现出可比的质量。我们的方法是第一个提供真正密集的3D场景分段,需要仅需要2D监督培训,并且不需要任何关于新颖场景的推论的语义输入。我们鼓励读者访问项目网站。
translated by 谷歌翻译
计算机愿景中的经典问题是推断从几个可用于以交互式速率渲染新颖视图的图像的3D场景表示。以前的工作侧重于重建预定定义的3D表示,例如,纹理网格或隐式表示,例如隐式表示。辐射字段,并且通常需要输入图像,具有精确的相机姿势和每个新颖场景的长处理时间。在这项工作中,我们提出了场景表示变换器(SRT),一种方法,该方法处理新的区域的构成或未铺设的RGB图像,Infers Infers“设置 - 潜在场景表示”,并合成新颖的视图,全部在一个前馈中经过。为了计算场景表示,我们提出了视觉变压器的概括到图像组,实现全局信息集成,从而实现3D推理。一个有效的解码器变压器通过参加场景表示来参加光场以呈现新颖的视图。通过最大限度地减少新型视图重建错误,学习是通过最终到底的。我们表明,此方法在PSNR和Synthetic DataSets上的速度方面优于最近的基线,包括为纸张创建的新数据集。此外,我们展示了使用街景图像支持现实世界户外环境的交互式可视化和语义分割。
translated by 谷歌翻译
Camera and lidar are important sensor modalities for robotics in general and self-driving cars in particular. The sensors provide complementary information offering an opportunity for tight sensor-fusion. Surprisingly, lidar-only methods outperform fusion methods on the main benchmark datasets, suggesting a gap in the literature. In this work, we propose PointPainting: a sequential fusion method to fill this gap. PointPainting works by projecting lidar points into the output of an image-only semantic segmentation network and appending the class scores to each point. The appended (painted) point cloud can then be fed to any lidaronly method. Experiments show large improvements on three different state-of-the art methods, Point-RCNN, Vox-elNet and PointPillars on the KITTI and nuScenes datasets. The painted version of PointRCNN represents a new state of the art on the KITTI leaderboard for the bird's-eye view detection task. In ablation, we study how the effects of Painting depends on the quality and format of the semantic segmentation output, and demonstrate how latency can be minimized through pipelining.
translated by 谷歌翻译
Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online 1 .
translated by 谷歌翻译
Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 -4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.
translated by 谷歌翻译